The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
We can see there is no missing data in this dataset.
Graph for checking missing data
All Predictor categorical variables vs Dependent Variable
## job
## y admin. blue-collar entrepreneur housemaid management retired
## no 0.4547206 0.6119017 0.5692308 0.5312500 0.5221843 0.2567050
## yes 0.5452794 0.3880983 0.4307692 0.4687500 0.4778157 0.7432950
## job
## y self-employed services student technician unemployed unknown
## no 0.4963504 0.5641791 0.2753623 0.5375191 0.4181818 0.5000000
## yes 0.5036496 0.4358209 0.7246377 0.4624809 0.5818182 0.5000000
## marital
## y divorced married single unknown
## no 0.5343348 0.5139665 0.4603837 0.3750000
## yes 0.4656652 0.4860335 0.5396163 0.6250000
## education
## y basic.4y basic.6y basic.9y high.school illiterate professional.course
## no 0.4856459 0.6287129 0.6088710 0.5106383 0.0000000 0.4872798
## yes 0.5143541 0.3712871 0.3911290 0.4893617 1.0000000 0.5127202
## education
## y university.degree unknown
## no 0.4539830 0.4130435
## yes 0.5460170 0.5869565
## default
## y no unknown yes
## no 0.4657942 0.6802508
## yes 0.5342058 0.3197492
## housing
## y no unknown yes
## no 0.4994299 0.5800000 0.4967381
## yes 0.5005701 0.4200000 0.5032619
## loan
## y no unknown yes
## no 0.4971098 0.5800000 0.5024470
## yes 0.5028902 0.4200000 0.4975530
## contact
## y cellular telephone
## no 0.4276730 0.6818981
## yes 0.5723270 0.3181019
## month
## y apr aug dec jul jun mar may
## no 0.3275862 0.5136986 0.1578947 0.5757098 0.5370741 0.1428571 0.6411215
## yes 0.6724138 0.4863014 0.8421053 0.4242902 0.4629259 0.8571429 0.3588785
## month
## y nov oct sep
## no 0.5176768 0.1419753 0.1007752
## yes 0.4823232 0.8580247 0.8992248
## day_of_week
## y fri mon thu tue wed
## no 0.5118550 0.5703422 0.4692671 0.4529703 0.5000000
## yes 0.4881450 0.4296578 0.5307329 0.5470297 0.5000000
## poutcome
## y failure nonexistent success
## no 0.4547325 0.5623198 0.0610687
## yes 0.5452675 0.4376802 0.9389313
#####Logistics Regression Assumption Check
Type of Selection : (Manual / Intuition)
##
## Call:
## glm(formula = y ~ job + contact + month + campaign + pdays +
## previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7513 -0.8615 -0.6019 0.8025 1.9204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.320e+02 8.705e+01 -3.814 0.000137 ***
## jobblue-collar -3.822e-03 1.307e-01 -0.029 0.976663
## jobentrepreneur 3.588e-02 2.469e-01 0.145 0.884460
## jobhousemaid 7.902e-02 2.861e-01 0.276 0.782418
## jobmanagement -8.257e-02 1.811e-01 -0.456 0.648444
## jobretired 1.159e-01 2.127e-01 0.545 0.585903
## jobself-employed 1.419e-01 2.606e-01 0.545 0.586016
## jobservices 1.021e-01 1.678e-01 0.608 0.543096
## jobstudent 1.204e-01 2.710e-01 0.444 0.656800
## jobtechnician -1.055e-01 1.371e-01 -0.770 0.441590
## jobunemployed 2.671e-01 2.839e-01 0.941 0.346835
## jobunknown -2.010e-01 4.852e-01 -0.414 0.678753
## contacttelephone -6.506e-01 1.684e-01 -3.864 0.000112 ***
## monthaug 7.814e-01 3.313e-01 2.358 0.018353 *
## monthdec -3.174e-01 5.342e-01 -0.594 0.552382
## monthjul 8.123e-02 2.197e-01 0.370 0.711543
## monthjun -1.077e+00 3.052e-01 -3.528 0.000418 ***
## monthmar 1.580e+00 3.764e-01 4.197 2.71e-05 ***
## monthmay -3.981e-01 1.880e-01 -2.117 0.034230 *
## monthnov -3.713e-01 2.658e-01 -1.397 0.162413
## monthoct 1.584e-01 3.669e-01 0.432 0.666055
## monthsep 9.180e-01 4.913e-01 1.869 0.061670 .
## campaign -2.066e-02 1.842e-02 -1.122 0.262045
## pdays -1.036e-03 8.363e-04 -1.239 0.215338
## previous 8.733e-02 2.103e-01 0.415 0.677964
## poutcomenonexistent 5.814e-01 2.758e-01 2.108 0.035037 *
## poutcomesuccess 8.731e-01 8.334e-01 1.048 0.294771
## emp.var.rate -1.964e+00 3.420e-01 -5.743 9.29e-09 ***
## cons.price.idx 2.726e+00 5.864e-01 4.649 3.33e-06 ***
## cons.conf.idx 5.600e-02 2.076e-02 2.698 0.006974 **
## euribor3m 9.084e-02 3.059e-01 0.297 0.766502
## nr.employed 1.538e-02 6.945e-03 2.214 0.026835 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3210.2 on 2968 degrees of freedom
## AIC: 3274.2
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + month + campaign + pdays + previous +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7664 -0.8295 -0.6153 0.8030 1.9294
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.389e+02 8.668e+01 -3.910 9.23e-05 ***
## contacttelephone -6.573e-01 1.679e-01 -3.914 9.06e-05 ***
## monthaug 7.622e-01 3.298e-01 2.311 0.020811 *
## monthdec -3.211e-01 5.328e-01 -0.603 0.546708
## monthjul 7.599e-02 2.188e-01 0.347 0.728324
## monthjun -1.105e+00 3.038e-01 -3.637 0.000276 ***
## monthmar 1.565e+00 3.750e-01 4.174 3.00e-05 ***
## monthmay -4.078e-01 1.864e-01 -2.187 0.028734 *
## monthnov -3.809e-01 2.652e-01 -1.436 0.151034
## monthoct 1.435e-01 3.659e-01 0.392 0.695019
## monthsep 9.011e-01 4.901e-01 1.839 0.065967 .
## campaign -2.071e-02 1.836e-02 -1.128 0.259254
## pdays -1.024e-03 8.350e-04 -1.226 0.220035
## previous 8.477e-02 2.094e-01 0.405 0.685613
## poutcomenonexistent 5.872e-01 2.750e-01 2.135 0.032753 *
## poutcomesuccess 8.967e-01 8.324e-01 1.077 0.281411
## emp.var.rate -1.988e+00 3.409e-01 -5.832 5.46e-09 ***
## cons.price.idx 2.775e+00 5.839e-01 4.752 2.01e-06 ***
## cons.conf.idx 5.793e-02 2.068e-02 2.802 0.005082 **
## euribor3m 7.810e-02 3.047e-01 0.256 0.797719
## nr.employed 1.586e-02 6.913e-03 2.294 0.021786 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3214.1 on 2979 degrees of freedom
## AIC: 3256.1
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + campaign + pdays + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6757 -0.8045 -0.5392 0.9808 2.0543
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.014e+02 3.388e+01 -2.994 0.00276 **
## contacttelephone -8.871e-01 1.277e-01 -6.946 3.76e-12 ***
## campaign -1.367e-02 1.800e-02 -0.760 0.44742
## pdays -9.384e-04 8.256e-04 -1.137 0.25571
## previous 1.421e-01 2.081e-01 0.683 0.49478
## poutcomenonexistent 8.014e-01 2.708e-01 2.960 0.00308 **
## poutcomesuccess 1.015e+00 8.271e-01 1.227 0.21983
## emp.var.rate -9.492e-01 1.573e-01 -6.036 1.58e-09 ***
## cons.price.idx 1.157e+00 2.250e-01 5.143 2.71e-07 ***
## cons.conf.idx 5.223e-02 1.324e-02 3.944 8.03e-05 ***
## euribor3m 1.930e-01 1.776e-01 1.086 0.27732
## nr.employed -1.009e-03 3.149e-03 -0.320 0.74875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3295.0 on 2988 degrees of freedom
## AIC: 3319
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + pdays + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed,
## family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6760 -0.8024 -0.5763 0.9853 2.0259
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.005e+02 3.387e+01 -2.967 0.00301 **
## contacttelephone -8.912e-01 1.276e-01 -6.987 2.81e-12 ***
## pdays -9.409e-04 8.257e-04 -1.139 0.25451
## previous 1.418e-01 2.080e-01 0.682 0.49523
## poutcomenonexistent 8.011e-01 2.707e-01 2.960 0.00308 **
## poutcomesuccess 1.013e+00 8.272e-01 1.225 0.22061
## emp.var.rate -9.590e-01 1.568e-01 -6.116 9.61e-10 ***
## cons.price.idx 1.155e+00 2.250e-01 5.133 2.85e-07 ***
## cons.conf.idx 5.187e-02 1.324e-02 3.918 8.94e-05 ***
## euribor3m 2.060e-01 1.769e-01 1.164 0.24432
## nr.employed -1.171e-03 3.143e-03 -0.373 0.70948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3295.5 on 2989 degrees of freedom
## AIC: 3317.5
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed,
## family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6590 -0.8059 -0.5752 0.9858 2.0284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.015e+02 3.387e+01 -2.998 0.00272 **
## contacttelephone -8.951e-01 1.276e-01 -7.018 2.25e-12 ***
## previous 2.466e-01 1.948e-01 1.266 0.20554
## poutcomenonexistent 8.970e-01 2.637e-01 3.401 0.00067 ***
## poutcomesuccess 1.885e+00 2.858e-01 6.595 4.25e-11 ***
## emp.var.rate -9.647e-01 1.566e-01 -6.159 7.30e-10 ***
## cons.price.idx 1.160e+00 2.250e-01 5.157 2.51e-07 ***
## cons.conf.idx 5.198e-02 1.325e-02 3.924 8.71e-05 ***
## euribor3m 2.136e-01 1.769e-01 1.208 0.22708
## nr.employed -1.271e-03 3.145e-03 -0.404 0.68598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3297.1 on 2990 degrees of freedom
## AIC: 3317.1
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed, family = "binomial",
## data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7025 -0.8173 -0.5716 0.9871 2.0373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.009e+02 3.388e+01 -2.979 0.00289 **
## contacttelephone -9.103e-01 1.272e-01 -7.154 8.42e-13 ***
## poutcomenonexistent 6.087e-01 1.361e-01 4.474 7.67e-06 ***
## poutcomesuccess 1.942e+00 2.828e-01 6.866 6.58e-12 ***
## emp.var.rate -9.652e-01 1.568e-01 -6.157 7.40e-10 ***
## cons.price.idx 1.182e+00 2.247e-01 5.258 1.46e-07 ***
## cons.conf.idx 5.246e-02 1.323e-02 3.965 7.35e-05 ***
## euribor3m 2.263e-01 1.765e-01 1.282 0.19992
## nr.employed -1.723e-03 3.128e-03 -0.551 0.58182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3298.8 on 2991 degrees of freedom
## AIC: 3316.8
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7033 -0.8110 -0.5697 0.9841 2.0445
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.178e+02 1.465e+01 -8.039 9.08e-16 ***
## contacttelephone -9.235e-01 1.250e-01 -7.386 1.51e-13 ***
## poutcomenonexistent 6.076e-01 1.361e-01 4.465 8.01e-06 ***
## poutcomesuccess 1.948e+00 2.826e-01 6.893 5.44e-12 ***
## emp.var.rate -9.857e-01 1.526e-01 -6.461 1.04e-10 ***
## cons.price.idx 1.271e+00 1.545e-01 8.228 < 2e-16 ***
## cons.conf.idx 5.754e-02 9.497e-03 6.059 1.37e-09 ***
## euribor3m 1.537e-01 1.178e-01 1.304 0.192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3299.1 on 2992 degrees of freedom
## AIC: 3315.1
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = y ~ contact + poutcome + emp.var.rate + cons.conf.idx +
## cons.price.idx, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6698 -0.7981 -0.5786 0.9835 1.9998
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.068e+02 1.200e+01 -8.903 < 2e-16 ***
## contacttelephone -8.733e-01 1.189e-01 -7.346 2.04e-13 ***
## poutcomenonexistent 6.184e-01 1.360e-01 4.548 5.42e-06 ***
## poutcomesuccess 1.931e+00 2.822e-01 6.842 7.79e-12 ***
## emp.var.rate -7.967e-01 4.555e-02 -17.490 < 2e-16 ***
## cons.conf.idx 6.046e-02 9.209e-03 6.565 5.19e-11 ***
## cons.price.idx 1.161e+00 1.294e-01 8.979 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3300.8 on 2993 degrees of freedom
## AIC: 3314.8
##
## Number of Fisher Scoring iterations: 5
## 2.5 % 97.5 %
## (Intercept) -130.79487834 -83.71427806
## contacttelephone -1.10795893 -0.64171154
## poutcomenonexistent 0.35273669 0.88611246
## poutcomesuccess 1.40193296 2.51426260
## emp.var.rate -0.88739499 -0.70873877
## cons.conf.idx 0.04258315 0.07870058
## cons.price.idx 0.91230508 1.41981182
Since our goal is to predict the outcome, removing the duration variable from stepwise full model even though it looked significant in the full model. Practically I do not believe customer response depends on which month or day_of_week they were contacted (could be just a coincidence). But for now keeping it in the model.
Finalizing the below equation as it had lowest AIC among Forward, Backward and Stepwise model. Backward and Stepwise has exactly same set of predictor variables.
y ~ housing + contact + month + day_of_week + pdays + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + nr.employed
## [1] "Stepwise Model Details:"
##
## Call:
## glm(formula = y ~ housing + contact + month + day_of_week + pdays +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## nr.employed, family = "binomial", data = bank.additional.sample.train[-11])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7924 -0.8471 -0.5624 0.7967 1.9671
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.512e+02 7.392e+01 -4.752 2.02e-06 ***
## housingunknown -5.957e-01 2.981e-01 -1.999 0.045661 *
## housingyes -1.858e-01 8.868e-02 -2.095 0.036127 *
## contacttelephone -6.804e-01 1.681e-01 -4.047 5.19e-05 ***
## monthaug 7.722e-01 3.322e-01 2.325 0.020096 *
## monthdec -3.266e-01 5.235e-01 -0.624 0.532665
## monthjul 7.834e-02 2.195e-01 0.357 0.721181
## monthjun -1.116e+00 2.947e-01 -3.787 0.000153 ***
## monthmar 1.627e+00 3.570e-01 4.557 5.20e-06 ***
## monthmay -3.861e-01 1.865e-01 -2.071 0.038403 *
## monthnov -3.376e-01 2.171e-01 -1.555 0.119908
## monthoct 1.780e-01 3.188e-01 0.558 0.576521
## monthsep 9.149e-01 4.778e-01 1.915 0.055500 .
## day_of_weekmon -3.421e-01 1.424e-01 -2.403 0.016270 *
## day_of_weekthu 1.604e-01 1.378e-01 1.164 0.244428
## day_of_weektue 9.308e-02 1.414e-01 0.658 0.510471
## day_of_weekwed 7.870e-02 1.382e-01 0.569 0.569116
## pdays -1.159e-03 7.874e-04 -1.472 0.140889
## poutcomenonexistent 5.134e-01 1.440e-01 3.566 0.000363 ***
## poutcomesuccess 7.860e-01 8.078e-01 0.973 0.330524
## emp.var.rate -1.988e+00 3.377e-01 -5.888 3.91e-09 ***
## cons.price.idx 2.849e+00 5.387e-01 5.289 1.23e-07 ***
## cons.conf.idx 6.266e-02 1.349e-02 4.644 3.42e-06 ***
## nr.employed 1.704e-02 4.739e-03 3.594 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3192.4 on 2976 degrees of freedom
## AIC: 3240.4
##
## Number of Fisher Scoring iterations: 5
## [1] "Step Model AIC:"
## [1] 3240.412
## Odds ratio 2.5 % 97.5 %
## (Intercept) 2.897335e-153 3.499620e-216 2.398704e-90
## housingunknown 5.511613e-01 3.072894e-01 9.885755e-01
## housingyes 8.304239e-01 6.979415e-01 9.880539e-01
## contacttelephone 5.064283e-01 3.642625e-01 7.040793e-01
## monthaug 2.164542e+00 1.128753e+00 4.150813e+00
## monthdec 7.213405e-01 2.585371e-01 2.012601e+00
## monthjul 1.081489e+00 7.033559e-01 1.662912e+00
## monthjun 3.276301e-01 1.838895e-01 5.837283e-01
## monthmar 5.087925e+00 2.527142e+00 1.024358e+01
## monthmay 6.796717e-01 4.715780e-01 9.795912e-01
## monthnov 7.134540e-01 4.661905e-01 1.091864e+00
## monthoct 1.194868e+00 6.396850e-01 2.231893e+00
## monthsep 2.496643e+00 9.787197e-01 6.368754e+00
## day_of_weekmon 7.102951e-01 5.373498e-01 9.389026e-01
## day_of_weekthu 1.173992e+00 8.961087e-01 1.538048e+00
## day_of_weektue 1.097545e+00 8.318350e-01 1.448129e+00
## day_of_weekwed 1.081880e+00 8.251246e-01 1.418531e+00
## pdays 9.988412e-01 9.973008e-01 1.000384e+00
## poutcomenonexistent 1.670974e+00 1.260140e+00 2.215751e+00
## poutcomesuccess 2.194611e+00 4.505936e-01 1.068882e+01
## emp.var.rate 1.369154e-01 7.063030e-02 2.654076e-01
## cons.price.idx 1.727765e+01 6.010540e+00 4.966561e+01
## cons.conf.idx 1.064666e+00 1.036878e+00 1.093198e+00
## nr.employed 1.017181e+00 1.007776e+00 1.026674e+00
## GVIF Df GVIF^(1/(2*Df))
## housing 1.026534 2 1.006568
## contact 3.067376 1 1.751393
## month 44.211482 9 1.234294
## day_of_week 1.050005 4 1.006118
## pdays 10.030446 1 3.167088
## poutcome 11.923293 2 1.858228
## emp.var.rate 163.142930 1 12.772742
## cons.price.idx 53.864403 1 7.339237
## cons.conf.idx 2.335350 1 1.528185
## nr.employed 73.248971 1 8.558561
## [1] "Forward Model Details:"
## [1] "Forward Model AIC:"
## [1] 3276.491
##
## Call:
## glm(formula = y ~ age + job + marital + education + default +
## housing + loan + contact + month + day_of_week + campaign +
## pdays + previous + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed, family = "binomial",
## data = bank.additional.sample.train[-11])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7426 -0.8526 -0.5169 0.7791 1.9976
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.255e+02 8.812e+01 -3.694 0.000221 ***
## age -2.827e-03 5.364e-03 -0.527 0.598110
## jobblue-collar 1.936e-01 1.656e-01 1.169 0.242272
## jobentrepreneur 4.790e-02 2.562e-01 0.187 0.851692
## jobhousemaid 2.028e-01 3.088e-01 0.657 0.511378
## jobmanagement -1.241e-01 1.852e-01 -0.670 0.502542
## jobretired 3.352e-01 2.579e-01 1.300 0.193764
## jobself-employed 1.303e-01 2.675e-01 0.487 0.626233
## jobservices 2.295e-01 1.816e-01 1.264 0.206383
## jobstudent 1.812e-01 2.865e-01 0.632 0.527119
## jobtechnician -4.346e-02 1.541e-01 -0.282 0.777878
## jobunemployed 3.703e-01 2.931e-01 1.264 0.206389
## jobunknown -7.066e-02 4.999e-01 -0.141 0.887588
## maritalmarried 1.892e-01 1.449e-01 1.306 0.191436
## maritalsingle 1.085e-01 1.662e-01 0.653 0.513944
## maritalunknown 5.728e-01 9.439e-01 0.607 0.543907
## educationbasic.6y -1.761e-01 2.420e-01 -0.728 0.466658
## educationbasic.9y -1.378e-01 1.948e-01 -0.707 0.479377
## educationhigh.school 3.488e-02 1.945e-01 0.179 0.857658
## educationilliterate 1.298e+01 2.641e+02 0.049 0.960817
## educationprofessional.course 3.823e-02 2.175e-01 0.176 0.860467
## educationuniversity.degree 2.333e-01 1.981e-01 1.178 0.238819
## educationunknown 1.316e-01 2.586e-01 0.509 0.610940
## defaultunknown -1.383e-01 1.264e-01 -1.094 0.273960
## housingunknown -5.528e-01 3.018e-01 -1.832 0.066964 .
## housingyes -1.863e-01 8.947e-02 -2.082 0.037358 *
## loanunknown NA NA NA NA
## loanyes 3.657e-02 1.212e-01 0.302 0.762963
## contacttelephone -6.546e-01 1.696e-01 -3.859 0.000114 ***
## monthaug 7.271e-01 3.361e-01 2.164 0.030496 *
## monthdec -3.786e-01 5.391e-01 -0.702 0.482478
## monthjul 1.053e-01 2.226e-01 0.473 0.636164
## monthjun -1.046e+00 3.099e-01 -3.375 0.000738 ***
## monthmar 1.594e+00 3.797e-01 4.197 2.70e-05 ***
## monthmay -3.690e-01 1.911e-01 -1.931 0.053508 .
## monthnov -3.915e-01 2.687e-01 -1.457 0.145151
## monthoct 1.187e-01 3.706e-01 0.320 0.748747
## monthsep 8.660e-01 4.946e-01 1.751 0.079961 .
## day_of_weekmon -3.535e-01 1.434e-01 -2.466 0.013679 *
## day_of_weekthu 1.449e-01 1.390e-01 1.043 0.296987
## day_of_weektue 8.182e-02 1.429e-01 0.573 0.566843
## day_of_weekwed 6.741e-02 1.396e-01 0.483 0.629126
## campaign -1.860e-02 1.848e-02 -1.006 0.314334
## pdays -1.030e-03 8.419e-04 -1.224 0.221133
## previous 8.111e-02 2.106e-01 0.385 0.700062
## poutcomenonexistent 5.676e-01 2.768e-01 2.051 0.040277 *
## poutcomesuccess 8.656e-01 8.383e-01 1.032 0.301843
## emp.var.rate -1.943e+00 3.450e-01 -5.630 1.80e-08 ***
## cons.price.idx 2.682e+00 5.935e-01 4.519 6.21e-06 ***
## cons.conf.idx 5.701e-02 2.100e-02 2.715 0.006624 **
## euribor3m 9.952e-02 3.091e-01 0.322 0.747487
## nr.employed 1.491e-02 7.028e-03 2.122 0.033838 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3174.5 on 2949 degrees of freedom
## AIC: 3276.5
##
## Number of Fisher Scoring iterations: 12
## [1] "Backward Model Details:"
## [1] "Backward Model AIC:"
## [1] 3240.412
##
## Call:
## glm(formula = y ~ housing + contact + month + day_of_week + pdays +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## nr.employed, family = "binomial", data = bank.additional.sample.train[-11])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7924 -0.8471 -0.5624 0.7967 1.9671
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.512e+02 7.392e+01 -4.752 2.02e-06 ***
## housingunknown -5.957e-01 2.981e-01 -1.999 0.045661 *
## housingyes -1.858e-01 8.868e-02 -2.095 0.036127 *
## contacttelephone -6.804e-01 1.681e-01 -4.047 5.19e-05 ***
## monthaug 7.722e-01 3.322e-01 2.325 0.020096 *
## monthdec -3.266e-01 5.235e-01 -0.624 0.532665
## monthjul 7.834e-02 2.195e-01 0.357 0.721181
## monthjun -1.116e+00 2.947e-01 -3.787 0.000153 ***
## monthmar 1.627e+00 3.570e-01 4.557 5.20e-06 ***
## monthmay -3.861e-01 1.865e-01 -2.071 0.038403 *
## monthnov -3.376e-01 2.171e-01 -1.555 0.119908
## monthoct 1.780e-01 3.188e-01 0.558 0.576521
## monthsep 9.149e-01 4.778e-01 1.915 0.055500 .
## day_of_weekmon -3.421e-01 1.424e-01 -2.403 0.016270 *
## day_of_weekthu 1.604e-01 1.378e-01 1.164 0.244428
## day_of_weektue 9.308e-02 1.414e-01 0.658 0.510471
## day_of_weekwed 7.870e-02 1.382e-01 0.569 0.569116
## pdays -1.159e-03 7.874e-04 -1.472 0.140889
## poutcomenonexistent 5.134e-01 1.440e-01 3.566 0.000363 ***
## poutcomesuccess 7.860e-01 8.078e-01 0.973 0.330524
## emp.var.rate -1.988e+00 3.377e-01 -5.888 3.91e-09 ***
## cons.price.idx 2.849e+00 5.387e-01 5.289 1.23e-07 ***
## cons.conf.idx 6.266e-02 1.349e-02 4.644 3.42e-06 ***
## nr.employed 1.704e-02 4.739e-03 3.594 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3192.4 on 2976 degrees of freedom
## AIC: 3240.4
##
## Number of Fisher Scoring iterations: 5
Performance comparison for simple logistic model
## Confusion Matrix and Statistics
##
##
## class.simple.logistics no yes
## no 411 208
## yes 76 305
##
## Accuracy : 0.716
## 95% CI : (0.6869, 0.7438)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 7.639e-15
##
## Sensitivity : 0.5945
## Specificity : 0.8439
## Pos Pred Value : 0.8005
## Neg Pred Value : 0.6640
## Prevalence : 0.5130
## Detection Rate : 0.3050
## Detection Prevalence : 0.3810
## Balanced Accuracy : 0.7192
##
## 'Positive' Class : yes
##
Removing duration variable here before trying LASSO to compare the performance with stepwise
## 54 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -78.261787018
## (Intercept) .
## age -0.001637056
## jobblue-collar 0.087205824
## jobentrepreneur .
## jobhousemaid 0.061354472
## jobmanagement -0.129059410
## jobretired 0.243481422
## jobself-employed 0.068456428
## jobservices 0.143912659
## jobstudent 0.150250085
## jobtechnician -0.065976214
## jobunemployed 0.267901150
## jobunknown .
## maritalmarried 0.113492457
## maritalsingle 0.029054379
## maritalunknown 0.293777788
## educationbasic.6y -0.150084551
## educationbasic.9y -0.117252962
## educationhigh.school .
## educationilliterate 2.737057683
## educationprofessional.course .
## educationuniversity.degree 0.158908707
## educationunknown 0.080193552
## defaultunknown -0.132160923
## defaultyes .
## housingunknown -0.406167726
## housingyes -0.169472352
## loanunknown -0.079018009
## loanyes 0.005044221
## contacttelephone -0.416196425
## monthaug .
## monthdec -0.504244424
## monthjul 0.076878962
## monthjun -0.297748659
## monthmar 1.000946604
## monthmay -0.631007335
## monthnov -0.420164077
## monthoct 0.149219311
## monthsep 0.101023272
## day_of_weekmon -0.373512545
## day_of_weekthu 0.087494087
## day_of_weektue 0.019468612
## day_of_weekwed 0.005810514
## campaign -0.019773486
## pdays -0.001213301
## previous .
## poutcomenonexistent 0.433379460
## poutcomesuccess 0.597784166
## emp.var.rate -0.725772662
## cons.price.idx 0.866413405
## cons.conf.idx 0.045285626
## euribor3m .
## nr.employed .
## [1] "CV Error Rate:"
## [1] 0.261
## [1] "Penalty Value:"
## [1] 0.001412619
Confusion matrix for LASSO
## [1] "Confusion matrix for LASSO"
## Confusion Matrix and Statistics
##
##
## class.lasso no yes
## no 406 204
## yes 81 309
##
## Accuracy : 0.715
## 95% CI : (0.6859, 0.7428)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4332
##
## Mcnemar's Test P-Value : 4.95e-13
##
## Sensitivity : 0.6023
## Specificity : 0.8337
## Pos Pred Value : 0.7923
## Neg Pred Value : 0.6656
## Prevalence : 0.5130
## Detection Rate : 0.3090
## Detection Prevalence : 0.3900
## Balanced Accuracy : 0.7180
##
## 'Positive' Class : yes
##
## [1] "Confusion matrix for Stepwise"
## Confusion Matrix and Statistics
##
##
## class.step no yes
## no 411 208
## yes 76 305
##
## Accuracy : 0.716
## 95% CI : (0.6869, 0.7438)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 7.639e-15
##
## Sensitivity : 0.5945
## Specificity : 0.8439
## Pos Pred Value : 0.8005
## Neg Pred Value : 0.6640
## Prevalence : 0.5130
## Detection Rate : 0.3050
## Detection Prevalence : 0.3810
## Balanced Accuracy : 0.7192
##
## 'Positive' Class : yes
##
## [1] "Overall accuracy for LASSO and Stepwise respectively"
## [1] 0.715
## [1] 0.716
Fitting a complex model using 4 variables and 1 interaction
##
## Call:
## glm(formula = y ~ euribor3m + cons.price.idx + poutcome + cons.conf.idx +
## cons.price.idx * cons.conf.idx, family = "binomial", data = bank.additional.sample.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7452 -0.8067 -0.6823 1.0055 1.7825
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 254.48918 56.44127 4.509 6.52e-06 ***
## euribor3m -0.55084 0.03446 -15.986 < 2e-16 ***
## cons.price.idx -2.70092 0.60665 -4.452 8.50e-06 ***
## poutcomenonexistent 0.53884 0.13494 3.993 6.52e-05 ***
## poutcomesuccess 1.98992 0.28157 7.067 1.58e-12 ***
## cons.conf.idx 6.91006 1.37533 5.024 5.05e-07 ***
## cons.price.idx:cons.conf.idx -0.07368 0.01478 -4.986 6.16e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4158.7 on 2999 degrees of freedom
## Residual deviance: 3347.4 on 2993 degrees of freedom
## AIC: 3361.4
##
## Number of Fisher Scoring iterations: 5
## 2.5 % 97.5 %
## (Intercept) 145.6665772 367.12728215
## euribor3m -0.6193081 -0.48416238
## cons.price.idx -3.9113653 -1.53106520
## poutcomenonexistent 0.2750174 0.80429988
## poutcomesuccess 1.4625421 2.57243656
## cons.conf.idx 4.2661557 9.66177624
## cons.price.idx:cons.conf.idx -0.1032418 -0.04526933
Performance comparison for simple logistic model + interaction term
## Confusion Matrix and Statistics
##
##
## class.complex.interaction no yes
## no 377 197
## yes 110 316
##
## Accuracy : 0.693
## 95% CI : (0.6634, 0.7215)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3884
##
## Mcnemar's Test P-Value : 9.188e-07
##
## Sensitivity : 0.6160
## Specificity : 0.7741
## Pos Pred Value : 0.7418
## Neg Pred Value : 0.6568
## Prevalence : 0.5130
## Detection Rate : 0.3160
## Detection Prevalence : 0.4260
## Balanced Accuracy : 0.6951
##
## 'Positive' Class : yes
##
LDA
## Confusion Matrix and Statistics
##
##
## bank.additional.lda.p no yes
## no 404 108
## yes 83 405
##
## Accuracy : 0.809
## 95% CI : (0.7832, 0.8329)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6182
##
## Mcnemar's Test P-Value : 0.08246
##
## Sensitivity : 0.7895
## Specificity : 0.8296
## Pos Pred Value : 0.8299
## Neg Pred Value : 0.7891
## Prevalence : 0.5130
## Detection Rate : 0.4050
## Detection Prevalence : 0.4880
## Balanced Accuracy : 0.8095
##
## 'Positive' Class : yes
##
Random Forest
##
## Call:
## randomForest(formula = y ~ ., data = bank.additional.sample.train[-11], importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 25.53%
## Confusion matrix:
## no yes class.error
## no 1253 260 0.1718440
## yes 506 981 0.3402824
## Confusion Matrix and Statistics
##
##
## bank.rf.pred no yes
## no 409 199
## yes 78 314
##
## Accuracy : 0.723
## 95% CI : (0.6941, 0.7505)
## No Information Rate : 0.513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4491
##
## Mcnemar's Test P-Value : 5.591e-13
##
## Sensitivity : 0.6121
## Specificity : 0.8398
## Pos Pred Value : 0.8010
## Neg Pred Value : 0.6727
## Prevalence : 0.5130
## Detection Rate : 0.3140
## Detection Prevalence : 0.3920
## Balanced Accuracy : 0.7260
##
## 'Positive' Class : yes
##
After running thru 50 iterations on same sample data set and running thru 100 iteration on different split dataset, we see that at K=10, the model has highest performance.
KNN
## classifications
## no yes
## no 416 71
## yes 225 288
## Confusion Matrix and Statistics
##
## classifications
## no yes
## no 416 71
## yes 225 288
##
## Accuracy : 0.704
## 95% CI : (0.6746, 0.7322)
## No Information Rate : 0.641
## P-Value [Acc > NIR] : 1.475e-05
##
## Kappa : 0.4123
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8022
## Specificity : 0.6490
## Pos Pred Value : 0.5614
## Neg Pred Value : 0.8542
## Prevalence : 0.3590
## Detection Rate : 0.2880
## Detection Prevalence : 0.5130
## Balanced Accuracy : 0.7256
##
## 'Positive' Class : yes
##
## Accuracy
## 0.704
Comparison of models
## [1] "Simple model Overall Accuracy:"
## Accuracy
## 0.716
## [1] "Complex model Overall Accuracy:"
## Accuracy
## 0.693
## [1] "Random Forest Overall Accuracy:"
## Accuracy
## 0.723
## [1] "LDA Overall Accuracy:"
## Accuracy
## 0.809
## [1] "KNN Overall Accuracy:"
## Accuracy
## 0.704
## [1] "Simple model Misclassification rate:"
## [1] 0.284
## [1] "Complex model Misclassification rate:"
## [1] 0.307
## [1] "RandomForest Misclassification rate:"
## [1] 0.277
## [1] "LDA Misclassification rate:"
## [1] 0.191
## [1] "KNN Misclassification rate:"
## [1] 0.296